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SUCCESSIVE NORMALIZATION OF RECTANGULAR ARRAYS 

By Richard A. Olshen-*^ and Bala Rajaratnam^ 

Stanford University 

Standard statistical techniques often require transforming data 
to have mean and standard deviation 1. Typically, this process of 
"standardization" or "normalization" is applied across subjects when 
each subject produces a single number. High throughput genomic and 
financial data often come as rectangular arrays where each coordinate 
in one direction concerns subjects who might have different status 
(case or control, say), and each coordinate in the other designates 
"outcome" for a specific feature, for example, "gene," "polymorphic 
site" or some aspect of financial profile. It may happen, when analyz- 
ing data that arrive as a rectangular array, that one requires BOTH 
the subjects and the features to be "on the same footing." Thus there 
may be a need to standardize across rows and columns of the rect- 
angular matrix. There arises the question as to how to achieve this 
double normalization. We propose and investigate the convergence 
of what seems to us a natural approach to successive normalization 
which we learned from our colleague Bradley Efron. We also study 
the implementation of the method on simulated data and also on 
data that arose from scientific experimentation. 

1. Introduction. This paper is about a method for normahzation, or 
regularization, of large rectangular sets of numbers. In recent years many 
statistical efforts have been directed towards inference on such rectangular 
arrays. The exact geometry of the array matters little to the theory that 
follows. Positive results apply to the situation where there are at least three 
rows and at least three columns. We explain difficulties that arise when 
either numbers only two. Scenarios to which methodology studied here ap- 
plies tend to have many more rows than columns. Data can be from gene 
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expression microarrays, SNP (single nucleotide polymorphism) arrays, pro- 
tein arrays, alternatively from large scale problems in imaging. Often there 
is one column per subject with rows consisting of real numbers (as in ex- 
pression). Subjects from whom data are gathered may be "afflicted," or not, 
with a condition that, while heritable, is far from Mendelian. A goal is to 
find rows, better groups of rows, by which to distinguish afflicted subjects 
from other subjects. One can be led to testing many statistical hypotheses 
simultaneously, thereby separating rows into those that are "interesting" for 
further follow-up and those that seem not to be. Genetic data tend to be 
analyzed by test "genes" (rows), beginning with their being "embedded" 
in a chip, perhaps a bead. There may follow a subsequent molecule that 
binds to the embedded "gene" /molecule. A compound that makes use of 
the binding preferences of nucleotides and to which some sort of "dye" is 
attached is then "poured." The strength of binding depends upon affinity 
of the "gene" or attached molecule and the compound. Laser light is shined 
on the object into which the test "gene" has been embedded; and from its 
bending, the amount of bound compound is assessed from which the amount 
of the "gene" is inferred. The basic idea is that a different afflicted status 
may lead to different amounts of "gene." 

With the cited formulation and ingenious technology, data may still suffer 
from problems that have nothing to do with differences between groups of 
subjects or with differences between "genes" or groups of them. There may 
be differences in background — by column, or even by row. Perhaps also 
"primers" (compounds) vary across columns for a given row. For whatever 
reasons, scales by row or column may vary in ways that do not enable 
biological understanding. Variability across subjects could be unrelated to 
afflicted status. 

Think now of the common problem of comparing variables that can vary 
in their affine scales. Because covariances are not scale- free, it makes sense to 
compare in dimensionless coordinates that are centered at 0, that is, where 
values of each variable have respective means subtracted off and are scaled 
by respective standard deviations. That way, each variable is somehow "on 
the same footing." 

Standardization, or normalization, studied here is done precisely so that 
both "subjects" and "genes" are "on the same footing." We recognize one 
might require only that "genes" (or some "genes") be on the same footing, 
and the same for "subjects." The successive transformations studied here 
apply when one lacks a priori opinions that might limit goals. Thus, "genes" 
that result from the standardization we study are transformed to have mean 
and standard deviation 1 across all subjects while the same is true for 
subjects across all "genes." How to normalize? One approach is to begin 
with, say, row, though one could as easily begin with columns. Subtract 
respective row means and divide by respective standard deviations. Now do 
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the same operation on columns, then on rows, and so on. Remarkably, this 
process tends to converge, even rapidly in terms of numbers of iterations, 
and to a set of numbers that have the described good limiting properties in 
terms of means and standard deviations, by row and by column. 

In this paper we show by examples how the process works and demon- 
strate for them that indeed it converges. We also include rigorous mathe- 
matical arguments as to why convergence tends to occur. Readers will see 
that the process and perhaps especially the mathematics that underlies it 
are not as simple as we had hoped they would be. This paper is only about 
convergence which is demonstrated to be exponentially fast (or faster) for 
examples. The mathematics here does not apply directly to "rates." The 
Hausdorff dimension of the limit set seems easy enough to study. Summaries 
will be reported elsewhere. 



2. Motivating example. We introduce a motivating example to ground 
the problem that we address in this paper. Consider a simple 3-by-3 matrix 
with entries generated from a uniform distribution on [0, 1]. We standardize 
the initial matrix X^^^ by row and column, first subtracting the row mean 
from each entry and then dividing each entry in a given row by its row 
standard deviation. The matrix is then column standardized by subtracting 
the column mean from each entry and then by dividing each entry by the 
respective column standard deviation. In this section, these four steps of row 
mean polishing, row standard deviation polishing, column mean polishing 
and column standard deviation polishing require one iteration in the process 
of attempting to row and column standardize the matrix. After one such it- 
eration, the same process is applied to resulting matrix X^^-* and the process 
repeated with the hope that successive renormalization will eventually yield 
a row and column standardized matrix. Hence these fours steps are repeated 
until "convergence" which we define as the difference in the Probenius norm 
between two consecutive iterations being less than 10~^. 

In order to illustrate this numerically, we start with the following 3-by-3 
matrix with independent entries generated from a uniform distribution on 
[0, 1] and repeat the process described above: 



(1) 



0.1182 0.7069 0.4145 
0.9884 0.9995 0.4648 
0.5400 0.2878 0.7640 



The successive normalization algorithm took 9 iterations to converge. The 
initial matrix, the final solution and relative (and log relative) difference for 
the 9 iterations are given below (see also Figure 1): 



(2) 



(final) 



-1.2608 1.1852 0.0756 
1.1852 0.0757 -1.2608 
0.0756 -1.2608 1.1852 
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Fig. 1. Relative differences at each iteration on the log scale — 3-by-3 dimensional exam- 
ple. 



The whole procedure of 9 iterations takes less than 0.15 seconds on a stan- 
dard modern laptop computer. We also note that the final solution has 
effectively 3 distinct entries. When other random starting values are used, 
we observe that convergence patterns can vary in the sense that convergence 
may not be monotonic. The plots in Figure 2 capture the type of convergence 
patterns that are observed in our simple 3-by-3 example. 

Despite the different convergence patterns that are observed, when our 
successive renormalization is repeated with different starting values, a sur- 
prising phenomenon surfaces. It always seems that the process converges, 
and, moreover, the convergence is very rapid. One is led naturally to ask 
whether this process will always converge and if so under what conditions. 
These questions lay the foundation for the work in this paper. 



Iteration no. Difference 



log (difference) 



(3) Successive Difference 



1 8.7908 

2 0.5018 

3 0.0300 

4 0.0019 

5 0.0001 

6 0.0000 

7 0.0000 

8 0.0000 

9 0.0000 



2.1737 
-0.6895 
-3.5057 
-6.2862 
-9.0607 
-11.8337 
-14.6064 
-17.3790 
-20.1516 




Fig. 2. Convergence patterns for the 3-by-3 example. 



3. Preliminaries. We establish the notation that we wih use by re- visiting 
a normahzation/standardization method that is traditional for multivariate 
data. If the main goal of a normalization of a rectangular array is achieving 
zero row and and column averages, then a natural approach is to "mean 
polish" the row (i.e., subtract the row mean from every entry of the rectan- 
gular array), followed by a column "mean polish." This cycle of successive 
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row and column polishes is repeated until the resulting rectangular array 
has zero row and and column averages. The following theorem proves that 
this procedure attains a double mean standardized rectangular array in one 
iteration where an iteration is defined as constituting one row mean polish 
followed by one column mean polish. 

Lemma 3.1. Given an initial matrix an iterative procedure to cycle 
through repetitions of a row mean polish followed by a column mean polish 
until convergence terminates in one step. 

Proof. Let be an n X k matrix, and define the following: 

Now the first part of the iteration, termed as a "row mean polish," sub- 
tracts from each element its respective row mean: 

The second step of the iteration, termed a "column mean polish," sub- 
tracts from each element of the current matrix its respective column mean: 

where 



X 



n 



=1 

After the second step of the iteration it is clear that the columns sum to 
zero; the previous operation enforces this. In order to prove that the iterative 
procedure terminates at the second part of the iteration it is sufficient to 
show that the rows of the current iterate sum to zero. Now note that 



= [xS']-xS' 

n 



n ■ 

r=l 
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It remains to show that the row sum of this matrix X^^^ expressed as the 
elements of X^'') sum to zero. So 



k 



(2) 



(0) 



(0). 




Z 



{kXl 



(0) 



(0) 



^ n k 
r=l j=l 



{0)^ 



(0)> 



= {kX\ 

= 0-0 
= 0. 

Note that the above double standardization is implicit in a 2-way ANOVA, 
and, though not explicitly stated, it can be deduced from the work of Scheffe 
[9]. It is nevertheless presented here, first, in order to introduce notation; 
second, as it is not available in this form above in the ANOVA framework; 
and third, for the intuition it gives since it is a natural precursor to the 
subject of work in the remainder of this paper. □ 

3.1. Example 1 (cont.): A 3-by-3 example with only mean polishing. We 
proceed to illustrate the previous theorem on the motivating example given 
after the introduction and draw contrasts between the two approaches. As 
expected the successive normalization algorithm terminates in one iteration. 
The initial matrix, the final solution and the column and row standard 
deviations of the final matrix are given below: 



(4) 



y{0) 



0.1182 0.7069 0.4145 
0.9884 0.9995 0.4648 
0.5400 0.2878 0.7640 



^^■j Y (column-polished) 



-0.4307 
0.4396 
-0.0089 



0.0422 
0.3347 
-0.3769 



-0.1333 
-0.0829 
0.2162 



-0.2568 
0.2091 
0.0477 



(7) ^^^(co/umns) = [0.1932 0.2311 0.2410 

^0.1952 

(8) Std{rows) = 0.2257 

0.2445 



0.2161 0.0407 
0.1043 -0.3134 
-0.3204 0.2727 
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We note that, unhke the motivating example, and as expected, the row and 
column means are both 0, but the standard deviations of the rows and the 
columns are not identical, let alone identically 1. Since mean polishing has 
already been attained, and we additionally require that row and column 
standard deviations to be 1, it is rather tempting to row and column stan- 
dard deviation polish the terminal matrix yC/*""') above. We conclude this 
example by observing the simple fact that doing so results in the loss of the 
zero row and column averages. 

3.2. The 2-hy-2 problem. We now examine the successive row and col- 
umn mean and standard deviation polishing for a 2 x 2 matrix and hence 
illustrate that for the results in this paper to hold true, the minimum of row 
{k) and column dimension (n) of the matrix under consideration must be at 
least 3, that is, min(A:,n) > 3. Consider the following general 2x2 matrix: 



a h 
c d 



If a < 6 and c< d, then after one row normalization, 

\-l 1 

so the variance of its values, denoted by (Sj-^^)^, is 0. Therefore, allowing 
for both inequalities to be reversed, and, assuming that, for example, a, b, 
c and d are i.i.d. with continuous distribution(s), then P{{S^p)'^ = 0) = 1/2; 
in which case the procedure is no longer well defined. 

A moment's reflection shows that if X is ?i x 2 with n odd, then after each 
row normalization, each column has an odd number of entries; each entry 
being —1 or +1. However, each row has exactly one —1 and one +1. Thus 

occurs only on a set of product Lebesgue measure 
0. With min(A:,n) > 3, both tend to 1 a.e. as m oo. Henceforth, assume 
min(A:, n) > 3. 

4. Theoretical considerations. For a matrix X as defined, take x S 7^'"^'^. 
Let A denote Lebesgue measure on "JZ"-^^ and P be the probability on the 
coordinates that renders them i.i.d. That is, Xjj(xjj) = Xjj. Xjj are i.i.d. 
-/V(0, 1). What is more, for any orthogonal linear transformation O that pre- 
serves orthogonality, OX ~ X, that is, OX is distributed as X (see Anderson 
[1] and Muirhead [8]). 

Because A and P are mutually absolutely continuous, if C = {algorithm 
for successive row and column normalization converges}, then P{C) = 1 iff 
\(jinxk \^ (j-j _ though only one direction is used. 
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For the remainder, assume that P governs X. Positive results will be ob- 
tained for 3 < min(n, /c) < max(n, A;) < oo. Our arguments require notation, 
to which we turn our attention now. We redefine the notation and sym- 
bols introduced in Lemma 3.1 as we now bring in row and column standard 
deviation polishing. Define X = X^'^^ : 

1=1 

XW = [X«], where XS]) = (X,, 

a.s. (Sf^)2>0 since k>3>l. 
By analogy, set 

X(^) = [xSf] where xg' = (X^ - X,)/ S<"; 




i=l 

Arguments sketched in what follows show that a.s. (sj^^)^ > since n > 3. 

For m odd, X^^ = {X^-'^ - xf'^)/ S^''^ with X^"'^ and {S^""-'^)^ 
defined by analogy to previous definitions. 

l^or m even, X^-^- = (X^-^- — X.^- )/ ^^j > agam with X.^- and 
g{"^ 1) (igfii^ed by analogy to earlier definitions. 

Back to the general problem. We first note that because the process we 
study is a coordinate process, there is no difference between regular con- 
ditional probabilities and regular conditional distributions (see Durrett [4], 
Section 4.1.c, pages 33 and 229-331 for more details). They can be com- 
puted as densities with respect to Lebesgue measure on a finite Cartesian 
product X Sph(g) where q = k refers to after row normalization and where 
q = n ii subsequent to column normalization. In a slight abuse of notation. 



,X(0))/Sf); 
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for any positive integer q we define Sph(q) = {x € TZ'^ : ||x|p = q} wliere tlie 
norm is the usual Euclidean norm. 

Let {rij -.1 = 1, . . . ,n;j = 1, . . . ,k} be a set of n • A; rational numbers, not 
all 0. Obviously, 

An inductive argument involving conditional densities shows that 

(9) ^(u U E-X"^^=o)=o- 

\-m=lri,...,r„fe j,j / 

Consequently 

(/ oo n \ / OD k \\ 

\m-li=l / Vm=lj=l // 

Further, a.s. X^'") is defined and finite for every m. 

What we know that the t distribution (Efron [5]) and geometric argu- 
ments require that X^^^ can be viewed as having probability distribution 
on 'JZ"-^^ that is the n-fold product of independent uniform distributions 
on Sph(/c) X • • • X Sph(/c). For the sake of clarity we note again that condi- 
tional probabilities are always taken to be "regular" and concentrated on the 
relevant product of spheres. Readers will note that in the cited arguments 
for j J'ijX.™'', (Sj^^™'')^ and (sj^'" ^^)^, relevant conditional probabili- 
ties and densities are used. Of course, after the ?nth row standardization 
-j^(2m-i) _ [x|^™ :i = 1, . . . ,n;j = 1, . . . ,k], and analogously for column 
standardization and X^^™). 

As an aside, write 51 (X) = X^^^^ on {min(sj''^)^ > 0}. Elsewhere, let 51 (X) = 
0. Necessarily gi depends on X only through its direction in TZ"'^^. Equiv- 
alently, gi is a function of X that is homogeneous of degree 0. Moreover, 
51 (X) ~ (7i(OX) for all orthogonal linear O on TZ^^^ . X^^^ has independent 
rows, each uniformly distributed on Sph(A;). 

We turn now to study X^^™" as m increases without bound. Note first 
that for m = 1, 2, . . . , X^^™"^) has joint distribution that is unchanged if 
two columns of X are transposed, therefore if two columns of X^'^™"^) are 
transposed. Since every permutation is a product of transpositions, X^^"^"^) 
is column exchangeable. Each is also row exchangeable. 

Write TT for a permutation of the integers {1, . . . , A;}; let 11 be the finite a- 
field of all subsets of {vr}. The marginal probability induced on {vr} from the 
joint distribution of (X, {vr}) is discrete and uniform, assigning probability 
l/k\ to each vr. 
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Write ^aSj^i to be the a-field, 

7-([(xSf)' :i = 1, . . . , fc; g = 2m - 1, 2m + 1, . . .]) X n. 

Theorem 4.1. E{{xfJ^^^)^\g2n.-i} = i^tTi)^^)^ M m = 1,2,.... 

Proof. Write {^fj^^^^^f = Eti(x!?""'^)'/[.(i)=«] where Ia is the in- 
dicator function of the event A. Obviously, (X^^^^-^ ^^)^ is ^2m-i measurable; 
{0^91,92} is a set of real numbers, doubly indexed by {og^^gj) € F; F is a finite 
subset of {1, . . . , A:} X {1,2, . . .}. Form B = < ag,,g,; (91, (72) G 

Note that Gj^2m-i) generated by {B x Q}, B of the cited form and 

(i) 

Q G n. In particular, each B x Q ^ ^2m-i- Proof of our claim is complete if 
we show that for m = 2, 3, . . . , 

f , {2m-l).2 _ f .^(1) x2 
JBxQ ^ JBxQ ^ ' 

The left-hand side of the display can be expressed as 

^V^[7r(l)=«]-f[7reQ]^B| = ^ | J^l^Sf ^ V^[7r(l)=«] 4gQ] | ■ 

Now, for any tt, the expression inside the sum is A; if vr S Q &rid if not. 
So, the expectation factors into P{B)P{Q). Retracing steps shows clearly 

that (Xj^^^™ m the computation just completed could be replaced by 

(X^j"^^)^ with equalities remaining true. The claim is now proven. □ 

Convergence of (X^J^^)^. The backwards martingale convergence theo- 
rem (see Doob [3]) entails that (X^^^^^ ^^)^ converges a.s. as m — > 00. So, for 
each fixed j G {1,...,A;}, (X^^^™^ "^^)^/[7r(i)=j] converges a.s. It follows that 

[(xj^"* i)-j2j converges a.s. as m — )■ 00. 

If previous arguments are perturbed so that tt denotes a permutation of 
{1, . . . ,n} with (X«^))2 r^pi^^^d by {y^%^f,G^l_, by Q^l (xg^^ by 

(Xg™^).)^ and Ef=i(X?""'^)2 by E^=l(xS^^)^ then one concludes that 

also [(X^^™"^)^] converges a.s. as m^oo. Without further argument it is 
unclear if the a.s. limits along odd, respectively even, indices are the same; 
and it is crucial to what remains that this is in fact true. 
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Obviously, r\m=i^2m-i ~ nm=i ^2m' ^° ^ Certain sense measurability 
is the same. Obviously, too, randomization of index is by columns in the first 
case and by rows in the second. But now a path to the required conclusion 
presents itself. Given our success in proving a.s. convergence along odd in- 
dices after randomizing columns and along even indices randomizing rows, 
and given that a requirement of our approach is that these two limits be iden- 
tical a.s., perhaps there is a path which would allow for the simultaneous 
randomizing of both columns and rows. Fortunately, that is the case. Thus, 
let TTi be a permutation of {1, . . . , n} and tt2 be a permutation of {1, . . . , /c}. 
With obvious product formulation of governing probability mechanism and 
further obvious formulation of decreasing u-fields, as an example of what 
can be proved, 

^((<W.(i))'|e2) = (X?),).^(,/ a.s. 

From this arguments for the display there are several paths by which one 
concludes that a.s., simultaneously for all (i, j), [(xj™"^)^] converges. Domi- 
nated convergence requires that the limit random matrix has expectation 1 
in all coordinates. As a consequence of this convergence, a.s. and simultane- 
ously for all (i, j), [(xg"^+^))2] - [(Xg™))2] ^ as m ^ oo. 

Convergence of [xj™^]. We turn now to key ideas in extending our argu- 
ment that [(X^J*^)^] converges almost surely simultaneously to the same con- 
clusion with the square removed. To limit notational complexity, we study 
first only odd indices as m grows without bound. Conclusions are identical 
for even indices, and by extension for indices not constrained to be odd or 
even. 

A first necessary step is to show that for arbitrary j, 

pfe(sf"^+^V>o| = i. 

To that end, let A be the event [limm,(5j^™ ^^)^ = 0]. Obviously, A = 
{x: {S^^^ ^^)^ — > 0} = {x:^^^™ — )• 0} regardless of square roots taken. 
We show that P{A] = 0. 

By way of contradiction, suppose that > 0. Write 

n 
1=1 

We know that the first term tends to 1 a.s. on A. Therefore, also the 
second term tends to 1 a.s. on A. Since for m > 1, x[^™ is bounded a.s., 
linim max; (X^^"^ ) (x) is a finite- valued random variable C = C (x) . Simple 
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considerations show that the only possibihties are that for aU /, C = +1 or 
C = —1 a.s. Similar arguments show that lim^ min^ (xj^™ is +1 or 

— 1. It is clear that there is no sequence {tti^} of {?n} along which 

-1 <limmin(xSJ'"'=~^^)(x) <lh^^max(xSJ'"'^~^^)(x) < 1. 

ml m I J 

It follows that limrnXj-^™ exists a.s. on A and that the limit of the 
sequence is +1 or —1 on A. 

Recall that X ~ — X, and this equality is inherited by all joint distri- 
butions of X^'") . However, when X is replaced by its negative, any limit of 
-j^(2m 1) |-,g(,Qj^^gg negative; a.s. convergence is a property of the probabil- 
ity distribution of X, and {x : x[^"^ converges} = {x : — X^^"^ converges}. 
Therefore, the only possibility is that (X^J™""^'')^ and (s]^'""^'')^ 1. 
The upshot is that P{A} = 0; > 0} = 1. 

Again, let us fix j. Consider a sample path of {X^^'"''"^'} along which 

lim„(5]^"''"^^)2 = D>0. Clearly, {i :Ik^^„,jxg™«"^^| > 0} / 0. Indeed, let 

E = E{j) = {i:for some {ruq} = {mq{i)} ,\mi^^{j^\x!l^'' ^\ > 0}. Row and 

column exchangeability of X^™) necessarily means that the cardinality of E 
is at least 2. 

Let io 7^ ii G E. Because min(n, k) > 3, there is a further subsequence of 
{{mq{i)}} — again, for simplicity, write it as {niq} — along which 

lim iXp'"' ^^1 and lim IX.^™''^ I both exist; 
limlX,-^™' ^^1 and lim IX.^"*'^ I both exist and 

mq niq 

lim IX,-^'"'^ — Xp™' ^^1 exists and is positive. 

mq ' w ' ^ 

The first two requirements can always be met off of the set of probability 
implicit in (9). That the third can be met as well is a consequence of the 
argument just concluded. In any case, if there were no such subsequence, 
then our proof would be complete because all X^^™"^ for j fixed tend to the 
same number. But now, write 

,^{2mq) ■^{2mq-l)-. ,„(2mq) „(2mg-l)s 



^{2mq-l) „(2m,j-l) w j^(2mg--l) . , <~,(2m,-l) 



Since (X.^™'^)^ — (X^^^™' ^^)'^ a.s., and likewise with iq replaced by ii, 
the first expression of the immediately previous display has limit 0. Thus, so 
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too does the second expression. This is possible only if S'j ' — > 1 (where 
we have taken the positive square root). Further, 

„(2mq-l) ,^(2m,-l) ^{2mg~l) 
W ioj j^(2mg-l) 

As a corollary to the above, one sees now that x^J'"'' — > g. Since the 
original {rrig} could be taken to be an arbitrary subsequence of {m}, we 
conclude that: 

. sf ^'-'Ul a.s., 

• X.^J"''"^^ ^ a.s. and 

X(m) 
Ij converges a.s. 

Now replace arguments for (i) and (ii) on columns by analogous arguments 
on rows. Deduce that every infinite subsequence of positive integers has a 
subsequence along which our desired conclusion obtains. 

5. Properties of successive normalization. We now comment on theo- 
retical properties of successive normalization. In particular, we elaborate 
on the generality of the result by showing that the Gaussian assumption is 
not necessary and serves only as a convenient choice of measure. We also 
discuss convergence in Lebesgue measure and the domains of attraction of 
successive normalization. 

5.1. Choice of probability measure. Write A for Lebesgue measure on 
j^nxk^ Thus X G T^'T-x'^ is an n X /c rectangular array of real numbers. Let 
P be a measure on the Borel sets of B of TZ^^^ that is mutually absolutely 
continuous with respect of A. By this we mean that \i B d 7^"^'^ is Borel 
measurable, then \{B) = iff P{B) = 0. One obvious example of such a 
P is the measure on B that renders its nk coordinates i.i.d. Gaussian. If 
is the (r, c) coordinate of X G 7^"^^, and P^^'*^) is the marginal measure 
of P on its (r,c) real coordinate, then P'^'^'^\{—oo,xq]) = <I>(xo) where <I> 
is the standard normal cumulative distribution function. This is what we 
mean by P henceforth. The i.i.d. specification requires that P is an n/c-fold 
product of identical probabilities, and because P is now defined uniquely 
for all product rectangles, it is defined uniquely for all B ^B. It is obvious 
that $ being mutually absolutely continuous with respect to one-dimensional 
Lebesgue measure means that the product measure P is mutually absolutely 
continuous with respect to the n/c-fold Borel product measure A. 

Now, let /i,/2,... be a sequence of ;B-measurable functions, TZ"-^^ ^ 
j^nxk^ Then {/^ converges} = {lim/m = lim fm simultaneously for all nk co- 
ordinates}. When /m(x) converges, then lim /^(x) = (hm/m(x))/{/„ converges}, 
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where for any set C, Ic is its indicator. Because each fm is S-measurable, 
hm/m(x) is also jB-measurable. If we are given that {fm} converges P- 
almost everywhere, then {fm converges} is a Borel set of P-measure 1. Its 
complement has P-measure and, therefore, A measure 0. It follows that 
{fm} converges except for a Borel subset of TZ^^^ of Lebesgue measure 0. 

In the present paper, the (r, c) coordinate of {fm} is the set of succes- 
sive normalizations of the initial real entry multiplied by the indicator of 
the subset of Jl"-^^ that is {successive normalizations possible}. The latter 
is clearly a Borel subset of TZ^^^ of P-measure 1, so its complement is a 
Borel set of A-measure 0. It is immaterial for convergence whether the first 
initialization is by row or by column. Readers should note that any study of 
asymptotic properties of {fm} under P may show that {/^(x)} has proper- 
ties such as row and column exchangeability, where the coordinate functions 
are taken to be random variables. They should note, too, that: (i) changing 
the original measure to one more conducive to computation is standard, and 
is what happens with "exponential tilting" as it applies to the study of large 
deviations of sums; (ii) the interplay between measure and topology, as is 
utilized here, is a standard approach in probability theory that is applied 
seldom to statistical arguments; the lack thus far owes only to necessity. 

5.2. Convergence in pth mean for Lebesgue measure. Whenever stan- 
dardization is possible, after one standardization the sum over all nk co- 
ordinates of squares of respective values is bounded by nk, it follows from 
dominated convergence that as m grows without bound, each term converges 
not only P-almost everywhere but also in pth. mean for every oo > p > 1 so 
long as P remains the applicable measure. Because A is not a finite mea- 
sure, convergence in pth mean for underlying measure A does not follow. It 
is impossible for such convergence to apply to Lebesgue measure on the full 
Euclidean space W'"'. This is because there is a set of positive Lebesgue 
measure whose members are not fixed points for normalization by both rows 
and columns. No matter which normalization comes first in any infinite, al- 
ternating sequence, the normalization is invariant to fixed scale multiples of 
each X G TZ"-^^, and, in particular, to fixed scale multiples of x in the set of 
positive Lebesgue measure just described. Obviously, no further arguments 
are required; and convergence in pth mean is immediate if Lebesgue measure 
A is restricted to a bounded subset of TZ"'^^. 

5.3. Domains of attraction. Reviewers of research presented here have 
wondered if we can describe simply what successive and alternating normal- 
ization does to rectangular arrays of data, beyond introductory comments 
about putting rows and columns on an equal footing and the analogy to 
computing correlation from covariance. We begin our reply here though 
details await further research and a subsequent paper. Please remember 
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invariance to either row or column normalization (when possible) to scale 
multiples of x G TZ^^^ . In other words, results of normalization are con- 
stant along rays defined by these multiples, and without loss of generality 
we can assume that x lies in an ?i-fold product of Sph(A;) where each of the 
n components is orthogonal to the linear one-dimensional subspace of TZ'^ 
consisting of its equiangular line; call it Sph(A;) \ {1}. Thus, without loss 
we assume that the object under study is X^"^^ subject to a row normal- 
ization of the process of successive normalization. We study only subsets 
of the n-fold product space that was described and that is complementary 
to {Um=i Uri,...,rfe Sij '^iJ^^™^ =0} where {rjj} are rational. The set in 
braces just before the comma has P-measure 0, and we know that on its 
complement, X^™) can be defined for all m = 1, 2, 

Because normalization always involves subtraction of a mean and divi- 
sion by a standard deviation and because each X^™) is row and column 
exchangeable, the limiting process we study here when P applies seems, at 
first glance, to be analogous to "domains of attraction" as that notion ap- 
plies to sequences of i.i.d. random variables. One obvious difference is that 
here limits are almost sure, unlike distribution. While a.s. limits of X^™) 
are shown to have row and column means and row and column standard 
deviations l,n x k arrays of real numbers with this property are obviously 
the only fixed points of the alternating process studied here. The Hausdorff 
dimension of the set of fixed points is not difficult to compute, but we have 
been unable thus far to give rigorously supported conditions for the domain 
of attraction (in the sense described) of each fixed point. The simple case 
for which domains of attraction for limits in distribution were described was 
a major development in the history of probability (see Feller [6], Gnedenko 
and Kolmogorov [7], Zolotarev [10]). We report some intuitive results, and 
next we look at a mathematical question that arose in our study of domains 
of attraction for which at present we have only a heuristic argument. 

Is there a set C W""^ for which P{E) = 1; X("')(x) converges for E, 
and for each fixed i lim Xjj (x) 7^ lim Xjj/ (x) for all j ^ j' if x S E7 Clearly 
one could ask the equivalent question for each fixed j, and a correspond- 
ing subset E' C TZ^^^. We conjecture the existence of E n E' . However, 
arguments available thus far do not confirm existence rigorously. Therefore, 
suggestions regarding domains of attraction as X^"^) grows without bound 
should be taken as only heuristic for now. 

Given that {S^""^)^ 1 on a subset of TZ^^^ with complementary P- 
measure 0, therefore Lebesgue measure 0, almost surely ultimately [meaning 
for m = m(x) large enough], row normalization does not perturb the (strict) 
sort of any row. By analogy, almost surely ultimately column normaliza- 
tion does not perturb the (strict) sort of any column. Thus, a.s. ultimately, 
successive and alternative row and column normalization do not perturb 
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any row or column sort. This joint strict sort determines an open subset of 
-j^nxk^ If we restrict ourselves to rows, then we may speak more precisely of 
n-fold products of sorts of members of Sph(A:) \ {1}. For a fixed row there 
are A;! strict sorts of the squared entries. Given a fixed sort of the squares 
there are, say, f{k) assignments of sign to the square roots of the k entries 
so that the sum of entries for that row is 0. For any assignments of signs 
to the necessarily nontrivial values of the k squares that renders the sum of 
entries for the row 0, there is always a scaling so that the sum of squares 
is any fixed value so that the variance of the set of numbers in that row is 
1. One computes /(3) = 2; /(4) = 4; and so on. Computation of the exact 
value of f{k) is slightly tricky and is not reported here. For all k, f{k) > 2. 
To summarize, for each fixed row there are a.s. ultimately f{k)k\ disjoint 
(open) invariant sets following row normalization, making for a.s. 
ultimately (open) invariant sets simultaneously for all n rows. If we count 
a.s. ultimately invariant sets subsequent to column normalization, then en- 
tirely analogous arguments result in a.s. ultimately [/(n)?!!]*^ disjoint (open) 
sets simultaneously for all columns. 

From computations, after a row normalization the surface area of the 
sphere in /c-space orthogonal to the equiangular line — that corresponds to 

only one row of X^™) — has surface area ~ '^/(fO(fr|) < 1 for ^ ^ 21. The 
expression Q as k oo. Even for k = 4, the quantity is only about 14.7 
(larger than the actual value). Remember that there are at most [f{k)k\] 
a.s. ultimately nonempty "invariant sets" for row normalization. Thus one 

sees that for k large the quantity ( lin^frnt sTof tph(fc)l ''''^^^y °- 

6. Computational results and applications. We include three examples 
to highlight and illustrate some computational aspects of our iterative pro- 
cedure. The first two examples are studies by simulation whereas the third 
example is an implementation on a real dataset. 

For the simulation study, we consider a 3-by-3 matrix and a 10-by-lO 
matrix both with entries generated independently from a uniform distribu- 
tion on [0, 1]. For a given matrix, the algorithm computes/calculates the 
following 4 steps at each iteration: 

(a) mean polish the column, 

(b) stand deviation polish the column, 

(c) mean polish the row and 

(d) stand deviation polish the row. 

These fours steps, which constitute one iteration, are repeated until 
"convergence" — which we define as when the difference in some norm-squared 
(the quadratic/Frobenius norm in our case) between two consecutive it- 
erations is less than some small prescribed value — which we take to be 
0.00000001 or 10"^. 
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6.1. Example 2: Simulation study on a 10-by-lO dimensional example. 
We proceed now to illustrate the convergence of the successive row and col- 
umn mean-standard deviation polishing for the simple 10-by-lO dimensional 
example cited. The algorithm took 15 iterations to converge. The initial ma- 
trix, the final solution, and relative (and log relative) difference for the 15 
iterations follow: 



(10) 



Xfi' 



nal 



(11) 



/0.8145 
0.7891 
0.8523 
0.5056 
0.6357 
0.9509 
0.4440 
0.0600 
0.8667 

V0.6312 

0.2486 
0.4516 
0.2277 
0.8044 
0.9861 
0.0300 
0.5357 
0.0871 
0.8021 
0.9891 

/ 1.2075 
-0.0736 
0.8858 
-0.9296 
-0.8358 
1.4926 
-0.5156 
-1.8680 
0.3596 
0.2771 



0.3551 
0.9970 
0.2242 
0.6525 
0.6050 
0.3872 
0.1422 
0.0251 
0.4211 
0.1841 

0.0669 
0.9394 
0.0182 
0.6838 
0.7837 
0.5341 
0.8854 
0.8990 
0.6259 
0.1379 



0.7258 
0.3704 
0.8416 
0.7342 
0.5710 
0.1769 
0.9574 
0.2653 
0.9246 
0.2238 

0.2178 
0.1821 
0.0418 
0.1069 
0.6164 
0.9397 
0.3545 
0.4106 
0.9843 
0.9456 



0.3736 
0.0875 
0.6401 
0.1806 
0.0451 
0.7232 
0.3474 
0.6606 
0.3839 
0.6273 

0.6766 
0.9883 
0.7668 
0.3367 
0.6624 
0.2442 
0.2955 
0.6802 
0.5278 
0.4116 



0.0216 
0.9106 
0.8006 
0.7458 
0.8131 
0.3833 
0.6173 
0.5755 
0.5301 
0.2751 



6026 \ 

7505 

5835 

5518 

5836 

5118 

0826 

7196 

9962 

3545/ 



0.2139 
1.7222 
-0.8659 
1.5223 
0.8041 
0.1374 
-0.7494 
-1.1055 
-1.0158 
-0.6632 



0.8939 
-1.2202 

0.8816 

0.6537 
-0.7288 
-1.2120 

1.5647 
-0.6428 

0.8070 
-0.9973 



0.2661 
-1.0461 
0.7930 
-0.7661 
-2.0057 
1.1351 
0.2025 
0.9269 
-0.5547 
1.0490 



-2.0026 
0.6465 
0.9515 
0.9476 
1.0328 

-0.5035 
0.5610 
0.3515 

-1.1339 

-0.8509 
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-0.5881 


-1.2477 


-0.4157 


1.1023 


0.5705 \ 


-0.8172 


0.5144 


-1.1740 


1.3022 


0.1458 


-0.9498 


-1.6621 


-1.1469 


1.0831 


0.0298 


0.9361 


0.5467 


-1.3402 


-1.3775 


-0.1931 


1.4824 


0.5929 


0.0202 


0.2768 


-0.6390 


-1.2741 


0.1766 


1.3125 


-1.1642 


-0.1005 


0.2646 


1.3840 


-0.0059 


-0.7521 


-1.9537 


-0.8323 


1.2448 


0.1167 


0.8827 


0.9259 


0.1669 


-0.5895 


1.0581 


-1.0805 


1.9828 


1.6114 


-0.9601 


1.5752 


-0.2727 


-0.7685/ 



Successive Difference 

(12) 

/ Iteration no. Difference log(difference) \ 



1 


84.1592 


4.4327 


2 


1.2860 


0.2516 


3 


0.1013 


-2.2897 


4 


0.0144 


-4.2402 


5 


0.0029 


-5.8434 


6 


0.0007 


-7.2915 


7 


0.0002 


-8.6805 


8 


0.0000 


-10.0456 


9 


0.0000 


-11.4000 


10 


0.0000 


-12.7492 


11 


0.0000 


-14.0955 


12 


0.0000 


-15.4403 


13 


0.0000 


-16.7841 


14 


0.0000 


-18.1272 


15 


0.0000 


-19.4699 



We note once more how tlie relative differences decrease linearly on the log 
scale (though empirically) and how this is again suggestive of the rate of 
convergence. As both the figure (see Figure 3) and the vector of relative 
differences indicate, there is a substantial jump at iteration 2, and then the 
curve behaves linearly. 

The whole procedure takes about 0.37 seconds on a standard modern 
laptop computer and terminates after 15 iterations. It might appear that the 
increase in the number of iterations increases with increase in dimension. For 
instance, the number of iterations goes from 9 to 15 as we go from dimension 
3 to 10. We should, however, bear in mind that when we go from dimension 
3 to 10 the "tolerance level" is kept constant at 0.00000001. The number 
of elements that must be close to their respective limiting values, however, 
goes from 9 in the 3-dimensional case and to 100 in the 10-dimensional case. 
The rapidity of convergence was explored further, and the process above 
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was repeated over 1000 simulations. The convergence proves to be stable in 
the sense that the mean and standard deviation of the number of steps until 
convergence over the 1000 simulations are 14.5230 and 2.0331, respectively. 
A histogram of the number of steps till convergence is given below (Figure 4). 

A closer look at the vector of successive differences suggests that the "bulk 
of the convergence" is achieved during the first iteration. This seems reason- 
able since the first steps render the resulting columns, then the rows and the 
members of their respective unit spheres (and even within cited subsets of 
them). Convergence then is only within these spheres. Our numerical results 
also indicate the the sequence of matrices X^*) changes most drastically dur- 
ing this first iteration. It suggests that mean polishing to a larger extent is 
responsible for the rapidity of convergence and is reminiscent of the result 
in Lemma 3.1 which states that if only row and column mean polishing are 
performed, then convergence is achieved immediately. We explore this issue 
further by looking at the distance between X^^'^ and X^^"'"'^^ and compare it 
to the distance between X^^^ and X^^'^°'^\ The ratio of these two distances 
follows: 

„ . dist(X(i),X(-^"'^')) 

Ratio = ; — T-TT j-z — 

dist(X(o),X(/^"'^0) 

For our 10-by-lO example, we simulated 1000 initial starting values and 
implemented our successive normalization procedure. The average value of 



5 




-20' ' ' ' 

5 10 15 

X (Iteration number) 



Fig. 3. Relative differences at each iteration on the log scale — 10-by-lO dimensional ex- 
ample. 
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the distance from the first iterate to the hmit as a proportion of the total 
distance to the hmit from the starting value is only 2.78%. One could in- 
terpret this heuristically as saying that on average the crucial first step 
does as much as 97.2% of the work towards convergence. We therefore 
confirm that the bulk of the convergence is indeed achieved in the first 
step (termed as a "one-step analysis" from now onwards). The distribu- 
tion of the ratio defined above is graphed in the histogram below (Fig- 
ure 5). We also note that none of the 1000 simulations yielded a ratio of 
over 10%. 

Yet a another illuminating perspective of our successive normalization 
technique is obtained when we track the number of sign changes in the 
individual entries of the matrices from one iteration to the next. Please 
remember that this is related to the "invariant sets" that were described in 
Section 5.3. Naturally, one would expect the vast majority of sign changes 
to occur in the first step as the bulk of the convergence is achieved during 
this first step. We record the number of sign changes at each iteration, as 
a proportion of the total number of sign changes until convergence, over 
1000 simulations, in our 10-by-lO case. The results are illustrated in the 




8 10 12 14 16 18 20 22 24 26 

Number of iterations to convergence(%) 



Fig. 4. Distribution of the number of steps to convergence. 
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table below^ (see Table 1). An empirical study of the occurrence of the sign 
changes reveals interesting heuristics. We note that on average 95% of sign 
changes occur during the first step and an additional 3% in the next step. 
The table also demonstrates that as much as 99% of sign changes occur 
during the first three iterations. When we examine infrequent cases where 
there is a sign change well after the first few iterations, we observe that 
the corresponding limiting value is close to zero thus indicating that a sign 
change well into the successive normalization technique (i.e., a change from 
positive to negative or vice versa) amounts to a very small change in the 
actual magnitude of the corresponding value of the matrix. 

We conclude this example by investigating more thoroughly whether the 
dimensions of the matrices have an impact on either the rapidity of conver- 
gence and/or on the one-step analysis. The following table gives the mean 



■^Since the number of iterations to convergence depends on the starting point, the length 
of the vector of the number of sign changes will vary accordingly. We summarize this vector 
by averaging over all the 1000 simulations the relative frequency of the number of sign 
changes for the first nine iterations. The first nine iterations were chosen as each of the 
1000 simulations required at least 9 iterations to converge. 




01 23456789 10 

Ratio(%) 



Fig. 
from 



5. Distribution of distance to limit after 1-step as a proportion of distance to limit 
initial-step. 
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Table 1 

Distribution of the occurrence of sign changes 



Iteration no. 


Relative frequency Relative cumulative frequency 


1 


94.97% 




94.97% 


2 


3T5% 




98.12% 


3 


1.03% 




99.15% 


4 


0.40% 




99.55% 


5 


0.20% 




99.75% 


6 


0.12% 




99.87% 


7 


0.05% 




99.92% 


8 


0.03% 




99.95% 


9 


0.02% 




99.97% 


10 and above 


0.03% 




100.00% 


Total 


100.00% 








Table 2 




Rapidity of convergence and one-step 


analysis for various 




p and n 


combinations 


P 


3 


4 


5 10 


n 


33 


25 


20 10 


mean(count) 


34.0870 


22.3500 


18.0720 14.5790 


std(count) 


5.3870 


3.3162 


2.5477 2.1099 


mean(ratio) 


2.7026 


2.5812 


2.6237 2.7207 


std(ratio) 


1.8483 


1.5455 


1.4768 1.2699 



and standard deviations of the number of iterations needed for convergence 
for various values of the dimension of the matrix, denoted by p and n when 
keeping the total number of cells in the matrix constant.^ Once more our suc- 
cessive normalization procedure is applied to 1000 uniform random starting 
values. Results of this exercise are given in the table below (see Table 2). 

We find that when n and p are close, convergence appears to be faster, 
keeping everything else constant. Interestingly enough, a one-step analysis 
performed for the different scenarios above tends to suggest that the one- 
step ratio, defined as the distance from the first iterate to the limit as a 
proportion of the total distance to the limit from the starting value, seems 
largely unaffected by the row or column dimension of the problem. 

6.2. Example 3: Simulation study on a 5-hy-5 dimensional example. We 
now proceed to further investigate the successive normalization procedure 



^Or approximately constant. 
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when one begins with column mean-standard deviation pohshing fohowed 
by row mean-standard deviation pohshing or vice versa on a simple 5-by-5 
dimensional example. The theory developed in the previous sections proves 
convergence of the successive normalization procedure whether the first nor- 
malization that is performed on the matrix is row polishing or column pol- 
ishing. 

The algorithm took 30 iterations to converge when one begins with column 
mean-standard deviation polishing, and when one begins with row mean- 
standard deviation polishing it took 26 iterations to converge. The initial 
matrix, the final solutions, log relative differences and their respective plots 
for both approaches are given below (see Figure 6): 



(13) 



(14) 



(15) 



/ 0.6565 
0.3099 
0.3316 
0.1882 

V 0.1007 



0.2866 
0.3548 
0.5358 
0.9908 
0.0282 



0.7095 
0.9052 
0.8658 
0.1192 
0.9553 



0.4409 
0.8758 
0.8650 
0.3552 
0.6311 



0.8645 \ 
0.0210 
0.0768 
0.3767 
0.1492 7 



jT-final 

starting with column polishing 



f 1.6360 
0.1093 
-0.6979 
0.2748 

\ -1.3223 



-0.4320 
-1.2446 
1.1193 
1.2421 
-0.6848 



yfinal 

starting with row polishing 



^ 1.4956 
0.3816 
-1.2158 
0.3478 

V -1.0092 



-0.4243 
-0.9267 
1.1775 
1.2181 
-1.0446 



-0.7863 
0.9477 
-0.1716 
-1.3091 
1.3192 



-0.7386 
0.5915 
0.1052 

-1.4096 
1.4514 



-1.0548 
1.2170 
1.1399 
-1.0112 
-0.2907 



-1.1620 
1.3267 
0.9966 
-0.9138 
-0.2475 



0.6371 
-1.0295 
-1.3897 

0.8034 

0.9786 / 



0.8293 
-1.3731 
-1.0634 
0.7573 
0.8499 



Successive Difference 
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25 




X (Iteration number) 



(a) 



10 15 20 25 

X (Iteration number) 



(b) 



Fig. 6. Relative differences at each iteration on the log scale for 5-by-5 dimensional 
example (a) starting with column polishing (b) starting with row polishing. 



/ Difference log(difference) \ 

starting witli starting witli 
Iteration no. column polishing row polishing 

1 2.9646 3.0255 

2 -0.5858 0.2539 

3 -1.5082 -0.8731 

4 -1.8814 -1.4650 

24 -15.0375 -17.0730 

25 -15.7028 -17.8229 

26 -16.3679 -18.5728 

27 -17.0331 

28 -17.6983 

29 -18.3635 

\ 30 -19.0287 - / 



As expected, the final solutions are different. The simulations were re- 
peated with different initial values, and we note that the convergence pat- 
terns (as illustrated in Figure 6) are similar whether the procedure starts 
with column polishing or row polishing, though the actual number of itera- 
tions required to converge can vary. 



6.3. Example 4-' Gene expression data set: 20426-by-63 dimensional ex- 
ample. We now illustrate the convergence of the successive row and column 
mean-standard deviation polishing for a real-life gene expression example, a 
20426-by-63 dimensional example. This dataset arose originally from a study 
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3 4 5 6 

X (Iteration number) 

Fig. 7. Relative differences at each iteration on the log scale — 20426-by-63 dimensional 
example. 

of human in-stent restenosis by Ashley et al. [2]. The algorithm took con- 
siderably longer in terms of time and computer resources but converged in 
eight iterations. The initial matrix and the final are too large to display, but 
the relative (and log relative) differences for the eight iterations are given 
subsequently: 



(17) 



Successive Difference 



ion no. 


Difference 


log (difference) \ 


1 


1.0465 


11.5583 


2 


0.0008 


4.4030 


3 


0.0000 


-0.2333 


4 


0.0000 


-4.7582 


5 


0.0000 


-9.2495 


6 


0.0000 


-13.7130 


7 


0.0000 


-18.1526 


8 


0.0000 


-22.5717 / 



1.0e-F005* 



V 

Note once more how the relative differences decrease linearly on the log scale 
(though empirically) and is once again suggestive of the rate of convergence. 
As both the figure (see Figure 7) and the vector of relative differences indi- 
cates, there is a jump between iteration 1 and 2 and then the curve behaves 
linearly. 
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Additionally the whole procedure takes about 853.2 seconds or approx- 
imately 14.22 minutes on a desktop computer^ versus 0.4 seconds for the 
10-by-lO example. However, the algorithm terminates after only eight it- 
erations. In this example the number of iterations does NOT change with 
the increase in dimensionality. It may make sense to investigate this be- 
havior more thoroughly, empirically, using simulation for rectangular but 
not square matrices. It seems that the ratio of the two dimensions or the 
minimum of the two dimensions may play a role. We should also bear in 
mind that the tolerance level, which is the sum of the individual differences 
squared, has been kept constant at 0.00000001. 

7. Conclusion. In this section we attempt to lend perspective to our re- 
sults and to point the way for future developments. Readers please note that 
for rectangular n x k arrays of real numbers with min(/c,n) > 3, the tech- 
nique beginning with rows (alternatively columns) and successively subtract- 
ing row (column) means and dividing resulting differences, respectively, by 
row (column) standard deviations converges for a subset of Euclidean TZ^^^ 
whose complement has Lebesgue measure 0. The limit is row and column 
exchangeable given the Gaussian probability mechanism that applies in our 
theoretical arguments. We do not offer other information on the nature of 
the exact set on which successive iterates converge. A single "iteration" of 
the process we study has four steps, two each, respectively, for rows and 
columns. Note that on the set for which the algorithm converges, conver- 
gence seems remarkably rapid, exponential or even faster, perhaps because 
after a half an iteration, the rows (alternately columns) lie as n (respectively 
k) points on the surface of the product of relevant unit spheres. Further it- 
erations adjust only directions, not lengths. 

Viewing the squares of the entries as the terms of a backwards martin- 
gale shows maximal inequalities for them, and therefore implicitly contains 
information on "rates of convergence" of the squares; but these easy results 
appear far from the best one might establish. Our arguments for (almost 
everywhere) convergence of the original signed entries do not have infor- 
mation regarding rates of convergence. One argues easily that if successive 
iterates converge, and no limiting entry is 0, then after finitely many steps 
(the number depending on the original values and the limiting values) , signs 
are unchanged. In our examples of small dimension, evidence of this can be 
made explicit. In particular we observe empirically that the vast majority 
of sign changes that are observed do indeed take place in the first few itera- 
tions. Any sign changes that are observed well after the first few iterations 
correspond to sign changes around entries with limiting values close to zero. 



^2 GHz Core 2 Duo processor and 2 GB of RAM. 
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We also have no information on optimality in any sense of the iterated trans- 
formations we study. One reason for our thinking that our topic is inherently 
difficult is that we were unable to view successive iterates as "contractions" 
in any sense familiar to us. 

If we take any original set of numbers, and multiply each number by the 
realized value of a positive random variable with arbitrarily heavy tails, then 
convergence is unchanged. Normalization requires that after half a single 
iteration the same points on the surface of the relevant unit spheres are 
attained, no matter the multiple. The message is that what matters for 
convergence are the distributions induced on the surfaces of spheres after 
each half iteration, and not the otherwise common heaviness of the tails of 
the probability distributions of individual entries. 

Acknowledgments. The authors gratefully acknowledge Bradley Efron 
for introducing them to the research question addressed in the paper. They 
also thank Johann Won and Thomas Quertermous for useful discussions and 
Thomas Quertermous for granting us access to and understanding of data we 
study and reference; Bonnie Chung and Cindy Kirby are also acknowledged 
for administrative assistance. 

REFERENCES 

[1] Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd 
ed. Wiley, Hoboken, NJ. MR1990662 

[2] Ashley, E. A., Ferrara, R., King, J. Y., Vailaya, A., Kuchinsky, A., He, X., 
Byers, B., Gerckens, U., Oblin, S., Tsalenko, A., Soito, A., Spin, J., 
Tabibiazar, R., Connolly, A. J., Simpson, J. B., Grube, E. and Quert- 
ermous, T. (2006). Network analysis of human in-stent restenosis. Circulation 
114 2644-2654. 

[3] DOOB, J. L. (1940). Regularity properties of certain functions of diance variables. 

Trans. Amer. Math. Soc. 47 455-486. MR0002052 
[4] DuRRETT, R. (1995). Probability: Theory and Examples, 2nd ed. Duxbury Press, 

Belmont, CA. MR1609153 
[5] Efron, B. (1969). Student's f-test under symmetry conditions. J. Amer. Statist. 

Assoc. 64 1278-1302. MR0251826 
[6] Feller, W. (1971). An Introduction to Probability Theory and Its Applications 2, 

2nd ed. Wiley, New York. 
[7] Gnedenko, B. V. and Kolmogorov, A. N. (1954). Limit Distributions for Sums 

of Independent Random Variables. Addison- Wesley, Boston, MA. MR0062975 
[8] MuiRHEAD, R. J. (1999). Aspects of Multivariate Statistical Theory. Wiley, New York. 

MR0652932 

[9] SCHEFFE, H. (1999). The Analysis of Variance. Wiley, New York. MR1673563 
[10] ZOLOTAREV, V. M. (1986). One-Dimensional Stable Distributions. Amer. Math. Soc, 
Providence, RI. MR0854867 



SUCCESSIVE NORMALIZATION OF RECTANGULAR ARRAYS 29 



Departments of Health Research and Policy, 

Electrical Engineering, and Statistics 
Stanford University 
Stanford, California 94305-5405 
USA 

E-MAIL: olshcn@stanford.edu 



Department of Statistics 
Stanford University 
Stanford, California 94305-4065 
USA 

E-MAIL: brajarat@stanford.edu 



